Within-host diversity of SARS-CoV-2 lineages and effect of vaccination

Viral and host factors can shape SARS-CoV-2 within-host viral diversity and virus evolution. However, little is known about lineage-specific and vaccination-specific mutations that occur within individuals. Here we analysed deep sequencing data from 2,146 SARS-CoV-2 samples with different viral lineages to describe the patterns of within-host diversity in different conditions, including vaccine-breakthrough infections. Variant of Concern (VOC) Alpha, Delta, and Omicron samples were found to have higher within-host nucleotide diversity while being under weaker purifying selection at full genome level compared to non-VOC SARS-CoV-2 viruses. Breakthrough Delta and Omicron infections in Comirnaty and CoronaVac vaccinated individuals appeared to have higher within-host purifying selection at the full-genome and/or Spike gene levels. Vaccine-induced antibody or T cell responses did not appear to have significant impact on within-host SARS-CoV-2 evolution. Our findings suggest that vaccination does not increase SARS-CoV-2 protein sequence space and may not facilitate emergence of more viral variants.


15
The SARS-CoV-2 pandemic continues to spread globally. Despite vaccination of over 60% of 16 the world population 1 , the risk of SARS-CoV-2 reinfections and breakthrough infections is 17 increasing due to the emergence of new viral variants 2,3 . Multiple variants of concern (VOC) 18 have demonstrated the ability to evade naturally-acquired or vaccine-induced immunity 4-6 . 19 Therefore, it is crucial to investigate the impact of vaccination on mutational and evolutionary 20 processes of SARS-CoV-2. 21 22 Genomic surveillance has been used to trace the transmission and evolution of SARS-CoV-2 23 mutations at local, regional, and global scales throughout the pandemic 7-9 . However, there is 24 still limited knowledge of how these mutations originate and accumulate within hosts. Within-25 host mutations can arise through replication errors or RNA damage/editing 10 and they may be 26 subject to fixation by stochastic (genetic drift) and deterministic (natural selection) processes. 27 We and others have previously found that the SARS-CoV-2 transmission bottleneck between 28 hosts is narrow 8,11-14 , suggesting that only few virions are transferred from the host during 29 transmission. Most of the low-frequency mutations are not transmitted between patients, which 30 constrains the use of intrahost single nucleotide variants (iSNVs) for effective contact 31 tracing 12,15,16 . However, it remains important to investigate within-host diversity of SARS-32 CoV-2 to understand host-level evolutionary forces. (Delta or Omicron) infections after Comirnaty or CoronaVac vaccination were studied. Our 52 results provide insights into the variation of within-host diversity, and the mutational patterns 53 and potential selection pressures acting on viruses. 54 55

Diversity of within-host mutations in SARS-CoV-2 samples
58 59 We analysed 2,146 SARS-CoV-2 samples from 2,053 different individuals, of whom 86 had 60 multiple samples (totalling 2,053 representative samples and 93 repeated samples for technical 61 control). The samples were collected from July-2020 to March-2022, covering the third to the 62 fifth COVID-19 waves in HK. All samples had genome coverage ≥90% with >100 reads of 63 sequencing depth ( Supplementary Fig. 1)  1,801 (88%) were uniquely observed in a single patient sample. This suggests that most iSNVs 82 are sporadic mutations occurring at distinct positions rather than recurrent mutations occurring 83 at specific mutation hotspots (Fig. 1B). 84 85 We found a weak correlation between viral load (Ct value) and the number of iSNVs per Kb. 86 Samples with higher viral load (lower Ct value) generally had less iSNVs ( Fig. 1A and  87 Supplementary Fig. 2A). However, while viral load decreased with detection lag (time since 88 symptom onset) ( Supplementary Fig. 2B), the correlation between detection lag and number of 89 iSNVs was not found to be significant ( Supplementary Fig. 2C). We also found that the viral 90 load did not significantly correlate with minor allele frequency (MAF) ( Supplementary Fig.  91 2D). Consistent with other studies, these results (Supplementary Fig. 2A Fig. 2E). We found the mean number of iSNVs per Kb 98 to be 0.045 across the full genome and the highest incidence of iSNVs were found in the ORF8 99 and ORF7a genes (Supplementary Table 1).  100  101  Consistent with previous reports 23,24 , we found some mutation types (C®U, G®A, A®G,  102 U®C, and G®U) occurred with higher-than-average frequencies, measured in terms of the 103 number of synonymous/nonsynonymous iSNVs per synonymous/nonsynonymous site (i.e., 104 and ) ( Fig. 1C; points above the dashed lines). The high frequency of C®U/G®A and 105 A®G/U®C mutants support the hypothesis of RNA editing in vivo via APOlipoprotein B 106 Editing Complex (APOBEC) and Adenosine Deaminase Acting on RNA (ADAR) enzymes 10,25 , 107 respectively. Interestingly, we observed a higher mutation frequency of G®U, but a lower 108 mutation frequency of C®A, which suggests a strand bias of the G®U mutation. The G®U 109 mutation may be associated to Reactive Oxygen Species (ROS)-related processes 26 (Fig. 1D). There were some shared iSNVs, i.e., found in multiple samples from 116 different patients, with five of them (labelled in Fig. 1E) observed in more than 20 samples 117 (frequency>1%). These five high-frequent iSNVs were found in samples from more than one 118 SARS-CoV-2 lineage and under different vaccination statuses, suggesting mutation homoplasy 119 rather than shared mutations from direct transmissions (Supplementary Table 2 ORF1ab, in all cases the synonymous nucleotide diversity is higher than or similar to the non-181 synonymous nucleotide diversity ( The incidence of iSNVs and nucleotide diversity may also be affected by vaccination. By  189 studying the samples of breakthrough infections from fully vaccinated (with two-doses of 190 Comirnaty or CoronaVac vaccines) patients, we found that the incidence of iSNVs in 191 Comirnaty Delta virus samples was significantly higher than that from the unvaccinated Delta  To investigate whether the within-host mutations detected in our vaccinated samples enable 241 immune escape, we studied the overlaps of identified within-host mutations with known 242 neutralizing antibody (nAb) escape mutations in the Spike gene and with experimentally-243 determined T cell epitopes across the full genome. 244 245 Although we found mutations on the receptor-binding domain (RBD) and near the S1/S2 246 cleavage site (e.g., R683L), the overall mutations did not significantly cluster in any specific 247 regions of the Spike gene ( Fig. 4A Since the samples were sequenced from HK cases, we repeated the above analysis while 278 focusing on the epitopes associated with HLAs prevalent in the HK population (Supplementary 279 Fig. 10, Methods). As for the above results ( Fig. 5A and Supplementary Fig. 9A), we did not 280 observe a significant difference in the number of overlapping CD8 + and CD4 + T cell epitopes 281 per mutation between the vaccinated and unvaccinated samples in the full genome ( Fig. 5B) or 282 in the Spike gene ( Supplementary Fig. 9B). 283 284 While the two candidate regions of positive selection mentioned in the previous section 285 (nsp3:448-451 and E:40-47) overlapped with many CD8 + T cell epitopes (N=8 and N=5), these 286 associations did not reach statistical significance (Supplementary Table 5). Overall, we did not 287 identify a surge of antibody escape mutations in any group, and different groups had a similar 288 level of mutation rates in T cell epitope regions. 289 290

292
In this study we have analysed Illumina amplicon data from 2,146 SARS-CoV-2 samples to 293 estimate intra-host variation of SARS-CoV-2 under different conditions. Similar to earlier 294 studies, we show that incidence of iSNVs in SARS-CoV-2 samples is low (0 to 2 iSNVs per 295 sample) 11,16 and that sample viral loads negatively correlate with within-host mutation 296 rates 11,12,15,16 , which suggests low viral load specimens are prone to bias toward falsely high 297 iSNVs rates. In agreement with reports from Tonkin-Hill et al. 15 where SARS-CoV-2 samples 298 with lower Ct value show good concordance in allele frequencies between replicates, we also 299 found the cut-off of Ct ≤25 can avoid most outliers. Evidence of RNA editing at the full 300 genome level, e.g., the widely reported biased C®U/G®A and A®G/U®C pairs of 301 mutations 23,24 , was observed in our study. We also found strong strand asymmetry of G®U 302 mutations in our data, suggestive of RNA damage or RNA editing (rather than replication errors) 303 on the plus stand 15 and possible association with ROS-related processes 26  Comirnaty vaccine is markedly more immunogenic than CoronaVac vaccine and this may 336 contribute to our observation 42 . It is also relevant to note that Comirnaty vaccine only has the 337 Spike protein as an immunogen but appears to impact on purifying selection elsewhere in the 338 genome. This may be a result of greater suppression of viral replication. Crucially, additional 339 purifying selection pressures imposed by vaccination may limit the evolutionary protein 340 sequence space as non-synonymous nucleotide diversity does not seem to be increasing. 341 Overall, Comirnaty and CoronaVac vaccination seemingly amplifies the within-host purifying 342 selection in VOCs. 343 344 We did not observe enrichment of VOC defining mutations for the non-VOC samples (data not  345 shown), which suggests that convergent evolution of VOC mutations is infrequent. Only three 346 of the mutations observed in our vaccinated samples overlap with known nAb escape supersites 347 on the Spike NTD or RBD regions and the predicted RBD immune escape potential is only 348 mildly (less than 10% immune escape) affected by these mutations. Evolution of T cell epitopes 349 under selection by the host immune system has been reported for other viruses 43-45 , and T cell 350 responses to SARS-CoV-2 have also been reported in most COVID-19 patients 46 . However, 351 we did not detect significant vaccination-specific T cell pressure on within-host diversity, 352 suggesting vaccine-induced pressure may not enhance exploration of immune escape pathways. 353

354
As HK used an elimination strategy to control COVID-19, the individuals investigated in our 355 study can be reliably categorised as immunologically naïve or vaccinated individuals, which is 356 a significant advantage of our study. Nonetheless, our study has some limitations. Virus genome was reverse transcribed with multiple gene-specific primers targeting different 393 regions of the viral genome. The synthesized cDNA was then subjected to multiple overlapping 394 2-kb PCRs for full-genome amplification. PCR amplicons obtained from the same specimen 395 were pooled and sequenced by using the Novaseq or iSeq sequencing platform (Illumina). 396 Sequencing library was prepared by using Nextera XT (Illumina). Generated sequencing reads 397 were quality trimmed by fastp with parameters ("-q 30 -5 -3 -c --detect_adapter_for_pe -l 50"). 398 Potential PCR duplicates were removed by samtools markdup (v1.11). The trimmed reads were 399 mapped to a reference virus genome by using BWA-MEM2 (v2.0pre2), and genome consensus 400 was generated by using iVar (v1.3.1) with the PCR primer trimming protocol (minimum 401 sequence depth of 100 and minimum Qvalue of 30 of Spike NTD mutations (Fig. 4).

483
Acquisition of SARS-CoV-2 CD8 + and CD4 + T cell epitopes 484 We obtained SARS-CoV-2 CD8 + and CD4 + T cell epitope data from the dashboard reported 485 by us 51 (https://www.mckayspcb.com/SARS2TcellEpitopes/; accessed on 15 May 2022) and 486 the Immune Epitope Database (IEDB) 52 (https://www.iedb.org; accessed on 15 May 2022) by 487 querying for the organism name: "SARS-CoV2" (taxonomy ID: 2697049), host: "human", 488 and assay: "T cell positive". The compiled data was processed to only include epitopes with 489 lengths between 9-11 residues for CD8 + and 13-20 residues for CD4 + , which represent the 490 typical range of HLA class I and II epitopes. Removing the epitopes with no or incomplete 491 HLA allele information resulted in a total of 1,324 unique CD8 + and 961 unique CD4 + epitope-492 HLA pairs (Supplementary Fig. 6). The analysis in Fig. 5A is based on this complete set of 493 known SARS-CoV-2 T cell epitopes. 494 495 For the analysis focused on epitopes targeted by T cells in the HK population (Fig. 5B), we 496 determined class I and class II HLA alleles prevalent in HK. For class I alleles, we employed 497 the IEDB's "Population Coverage" tool (http://tools.iedb.org/population/) to identify 12 HLA 498 class I alleles that together cover >99% of the HK population ( Supplementary Fig. 10A, left 499 panel). A total of 630 unique SARS-CoV-2 CD8 + T cell epitopes were associated with these 500 alleles ( Supplementary Fig. 10A in the HK population ( Supplementary Fig. 10B, left panel). A total of 258 unique SARS-CoV-504 2 CD4 + T cell epitopes were associated with these alleles (Supplementary Fig. 10B, right panel). 505 506 Overlapping T cell epitopes per mutation 507 To study whether the within-host mutations (minor allele variants) affect the T cell response 508 generated against different SARS-CoV-2 lineages and under different vaccination status, we 509 used the metric Overlapping T cell epitopes per mutation. It is computed as the number of T 510 cell epitopes overlapping the within-host mutations observed in each group divided by the total 511 number of within-host mutations observed in that group. The T cell epitope data used in the 512 calculation of this metric was either from the complete set ( Fig. 5A) or from the set specific to 513 the HK population (Fig. 5B). 514 515

516
For bootstrapping analysis, the measurement can be taken from the same sample measured 517 repeatedly. For the other tests (e.g., Wilcoxon tests), the measurements were taken from distinct 518 samples. All the statistical tests in this study are two-sided and no adjustment for multiple 519 comparisons was performed unless specified. 520 521

522
The sequencing data used in this study can be access through NCBI Sequence Read Archive 523 ( Detailed scripts for the above analysis are available from https://github.com/Leo-Poon-528 Lab/mutations-under-sarscov2-vaccination. 529 530 Competing interests 531 None 532 533 Acknowledgments and funding sources 534 We acknowledge the technical support provided by colleagues from the Centre for PanorOmic 535 Sciences of the University of HK. We also acknowledge the Centre for Health Protection of 536 the Department of Health for providing epidemiological data for the study. The computations 537 were performed using research computing facilities offered by Information Technology 538 Services, the University of HK. The funding bodies had no role in the design of the study, the 539 collection, analysis, and interpretation of data, or writing of the manuscript. This  Significance was evaluated using Z-tests of the null hypothesis that − = 0 (10,000 713 bootstrap replicates, codon unit); P-value ≤ 0.01, ≤ 0.05 and ≤ 0.10 are labelled with "**", 714 "*" and "^", respectively. 715 716         The P-value for T cell epitope overlap is defined as the probability of observing at least the same number of overlapping epitopes in the gene's window of same length as the codon range.